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Abstract 

Big data are data on a massive scale in terms of volume, intensity, and complexity 
that exceed the capacity of standard analytic tools. They present opportunities as 
well as challenges to statisticians. The role of computational statisticians in scientific 
discovery from big data analyses has been under-recognized even by peer statisticians. 

This article summarizes recent methodological and software developments in statistics 
that address the big data challenges. Methodologies are grouped into three classes: 
subsampling-based, divide and conquer, and online updating for stream data. As 
a new contribution, the online updating approach is extended to variable selection 
with commonly used criteria, and their performances are assessed in a simulation 
study with stream data. Software packages are summarized with focuses on the 
open source R and R packages, covering recent tools that help break the barriers of 
computer memory and computing power. Some of the tools are illustrated in a case 
study with a logistic regression for the chance of airline delay. 

Key WORDS: bootstrap; divide and conquer; external memory algorithm; high perfor¬ 
mance computing; online update; sampling; software; 


1 Introduction 


A 2011 McKinsey repo rt predicted shortage of talent necessary for organizations to take 
advantage of big data flManvika et al.l . l201ll) . Data now stream from da ily life thank s to 
technological advances, and big data has indeed become a big deal (e.g., Shaw . 2ni4h . In 
the President’s Corner of the June 2013 issue of AMStat News, the three presidents (elect, 
current, and past) of the A merican Statist i cal A ssociation (ASA) wrote an article titled 
“The A SA and Bi g Data ” f Schenker et ah . 2013 ). This article echos the June 2012 col¬ 
umn of iRodriguezI (120121) on the recent media focus on big data, and discusses on what 
the statistics profession needs to do in response to the fact that statistics and statisti¬ 
cians are missing from big data discussions. In the followup July 2013 column, president 
Marie Davidian further raised the issues of statistics not being recognized as d ata science 
and m ainstream academic statisticians being left behind by the rise of big data flDavidianl . 
20131) . A white paper prepared by a working group of the ASA called for more ambitious 
efforts from statisticians to work together with researchers in ot her fields on nation al re¬ 
search priorities in order to achieve better science more quickly (IRndin et ahl . I2014J ). The 
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same concern was ex pres s ed in a 2014 president’s address of the Institnte of Mathemat¬ 
ical Statistics (IMS) (IYiI 2ni4J ). President Bin Yu of the IMS called for statisticians to 
own Data Science by working on real problems such as those from genomics, neuroscience, 
astronomy, nanoscience, computational social science, personalized medicine/healthcare, 
hnance, and government; relevant methodology/theory will follow naturally. 

Big data in the media or the business world m ay mean differently than what are familiar 
to academic statisticians ( Jordan and Linl. 2014 ). Big data are data on a massive scale in 
terms of volume, intensity, and c omplexity that excee d the ability of standard software 
tools to manage and analyze (e.g., ISniiders et al.l. |2012[) . The orig in of the term “big data” 
as it is understood today has been traced back in a recent study ( Dieboldl. 2012[) to lunch- 
table convers ations a t Silic on Graphics in the mid-1990s, in which John Mashey hgured 
prominently ( Mashev . 1998[l . Big data are generated by countless online interactions among 
people, transactions between people and systems, and sensor-enabled machinery. Internet 
search engines (e.g., Google and YouTube) and social network tools (e.g., Facebook and 
Twitter) generate billions of activity data per day. Rather than Gigabytes and Terabytes, 
nowa days, the data produ ced are estimated by zettabytes, and are growing 40% every 


day (Fan and Bifet . 20131) . In the big data analytics world, a 3V dehnition by LanevI 


(I 2 OOII) is widely accepted: volume (amount of data), velocity (speed of data in and out), 
and variety (range of data types and sources). High variety brings nontraditional or even 
unstructured data types, such as social network sentiments and internet map usage, which 
calls for new, creative w ays to understand the structure of data and even to ask intelligent 
research questions le.g.. iJordan and Linl. l2014l) . High volume and high velocity may bring 
noise accumulation, spurious correlation and inci dental homogene ity, creating issues in 


computational feasibility and algorithmic stability ( Fan et ah . 2014|) . 


Notwithstanding that new statistical thinking and methods are needed for the high 
variety aspect of big data, our focus is on htting standard statistical models to big data 
whose size exceeds the capacity of a single computer from its high volume and high ve¬ 
locity. There are two computational barriers for big data analysis: 1) the data can be 
too big to hold in a computer’s memory; and 2) the computing task can take too long to 
wait for the results. These barriers can be approached either with newly developed sta¬ 
tistical methodologies and/or computational methodologies. Despite the impression that 
statisticians are left behind in media discussions or governmental summits on big data, 
some statisticians have made important contributions and are pushing the frontier. Sound 


nosed (Jordan. 

2013 

1 . Examples are subsamnling-based annroaches (Kleiner et ah. 2014: 

Ma et ah. 

2013; 

Liane et ah 

. 12013: Maclaurin and Adams. 2014). divide and conquer an- 

proaches ( 

Lin and X 

. 2011: 

Chen and Xie. 2014: Song and Liang. 20141: Neiswanger et ah. 

2013'). and online updatine approaches (Schifano et ah. 2015). From a computational per- 


sp ective, much ef f ort has been put into the most active, open source statistical environment, 
R ( R Core Teaml . 2014al) . S tatistician R d evelopers are relentless in their drive to extend 
the reach of R into big data ( Rickert . 2013h . Recent UseR! conferences had many presenta¬ 
tions that directly addressed big data, including a 2014 keynote lecture by John Chambers, 
the inventor of the S language (jChambersl . 120141) . Most cutting edge methods are hrst 
and easily implemented in R. Given the open source nature of R and the active recent 
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development, our focus on software for big data will be on R and R packages. Revolution 
R Enterprise (RRE) is a commercialized version of R, but it offers free academic use, so it 
is also included in our case study and benchmarked. Other commercial software such as 
SAS, SPSS, and MATLAB will be briefly touched upon for completeness. 

The rest of the article is organized as follows. Recent methodological developments in 
statistics on big data are summarized in Section [2J Updating formulas for commonly used 
variable selection criteria in the online setting are developed and their performances studied 
in a simulation study in Section [3l Resources from open source software R for analyzing 
big data with classical models are summarized in Section |H Commercial software products 
are presented in Section [5l A case study on a logistic model for the chance of airline delay 
is presented in Section El A discussion concludes in Section [71 


2 Statistical Methods 

The recent methodologies for big data can be loosely grouped into three categories: resampling- 
based, divide and conquer, and online updating. To put the different methods in a context, 
consider a dataset with n independent and identically distributed observations, where n is 
too big for standard statistical routines such as logistic regression. 


2.1 Subsampling-Based Methods 

2.1.1 Bags of Little Bootstrap 


Kleiner et ah! (120141) proposed the bags of little bootstrap (BLB) approach that provides 
both point estimates and qua lity measures such as variance or conhdence int ervals. It is a 


comb ination of subsampli ng (Politis et ah . IQOOh . the m-out-of-n bootstrap ( Bickel et al.l. 


19971) . and the bootstrap flEfronl. Il979l) to achieve computational efficiency. BLB consists 
of the following steps. First, draw s subsamples of size m from the original data of size n. 
For each of the s subsets, draw r bootstrap samples of size n instead of m, and obtain the 
point estimates and their quality measures (e.g., conhdence interval) from the r bootstrap 
sample. Then, the s bootstrap point estimates and quality measures are combined (e.g., by 
average) to yield the overall point estimates and quality measures. In summary, BLB has 
two nested procedures: the inner procedure applies the bootstrap to a subsample, and the 
outer procedure combines these mul tiple bootstrap estim ates. The subsample size m was 
suggested to be with 7 G [0.5,1] flKleiner et al.l. l2014j) . a much smaller number than n. 
Although the inner bootstrap procedure conceptually generates multiple resampled data of 
size n, what is really needed in the storage and computation is a sample of size m with a 
weight vector. In contrast to subsampling and the m-out-of-n bootstrap, there is no need 
for an analytic correction (e.g., to rescale the conhdence intervals from the hnal 

result. The BLB procedure facilitates distri buted comput i ng by letting each subsample of 
size m be processed by a separate processor. Kleiner et ah ( 20141) proved the consistency of 
BLB and provided high order correctness. Their simulation study showed good accuracy, 
convergence rate and the remarkable computational efficiency. 
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2.1.2 Leveraging 


Ma and Sun! fl2014j) proposed to use leveraging to facilitate scientific discoveries from big 
data using limited computing resources. In a leveraging method, one samples a small pro¬ 
portion of the data with certain weights (subsample) from the full sample, and then per¬ 
forms intended computations for the full sample using the small subsample as a surrogate. 
The key to success of the leveraging methods is to construct the weights, the nonuniform 
sampling probab ilities, so that influential data points are sampled with high probabilities 
( Ma et ahl. l2013l) . Leveraging methods are different from the traditional subsampling or 
m-out-of-n bootstrap in that 1) they are used to achieve feasible computation even if the 
simple analytic results are available; 2) they enable visualization of the data when visualiza¬ 
tion of the full sample is impossible; and 3) they usually use unequal sampling probabilities 
for subsampling data. This approach is quite unique in allowing pervasive access to extract 
information from big data without resorting to high performance computing. 


2.1.3 Mean Log-likelihood 


Liang et al.l (120131) proposed a resampling-based stochastic approximation approach with 
an application to big geostatistical data. The method uses Monte Carlo averages calculated 
from subsamples to approximate the quantities needed for the full data. Motivated from 
minimizing the Kullback-Leibler (KL) divergence, they approximate the KL divergence 
by averages calculated from subsamples. This leads to a maximum mean log-likelihood 
estimation method. The solution to the mean score equation is obtained from a stochastic 
approximation procedure, where at each iteration, the current estimate is updated based on 
a subsample of size m dr awn from the full da ta. As m is much smaller than n, the method 
is scalable to big data. iLiang et al.l (1201311 established the consistency and asymptotic 
normality of the resulting estimator under mild conditions. In a simulation study, the 
convergence rate of the method was almost independent of n, the sample size of the full 
data. 


2.1.4 Subsampling-Based MCMC 

As a popular general purpose tool for Bayesian inference, Markov chain Monte Carlo 
(MCMC) for big data is challengi ng because of the prohi bitive cost of likelihood evaluation 
of every datum at every iteration. Liang and Kiml ( 2013h extended the mean log-likelihood 
method to a bootstrap Metropolis-Bastings (MB) algorithm in MCMC. The likelihood ra¬ 
tio of the proposal and current estimate in the MB ratio is replaced with an approximation 
from the mean log-likelihood based on k bootstrap samples of size m. The algorithm can be 
implemented exploiting the e mbarrassingly parallel structu re and avoids repeated scans of 
the full dataset in iterations. Maclaurin and Adams ( 2014J) proposed an auxiliary variable 
MCMC algorithm called Firefly Monte Carlo (FlyMC) that only queries the likelihoods of a 
potentially small subset of the data at each iteration yet simulates from the exact posterior 
distribution. For each data point, a binary auxiliary variable and a strictly positive lower 
bound of the likelihood contribution are introduced. The binary variable for each datum 
effectively turn on and off data points in the posterior, hence the “hrefly” name. The 
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probability of turning on each datum depends on the ratio of its likelihood contribution 
and the introduced lower bound. The computational gain depends on that the lower bound 
is tight enough and that simulation of the auxiliary variables is cheap enough. Because of 
the need to hold the whole data in computer memory, the size of the data this method can 
handle is limited. 

The pseudo-marginal Metropolis-Hasting algorithm replaces the i ntractable target (pos¬ 
terior ) density in the MH algorithm with an unbiased estimator flAndrieu and Robertsl . 
20091) . The the log-likelihood is estimated by an nnbiased snbsampled version, and an un¬ 
biased estimator o f the likelihoo d is ob tained by correcting the bias of the exponentiation 
of this estimator. Quiroz et ah ( 20141) proposed subsampling the data using probability 
proportional-to-size (BPS) sampling to obtain an approximately unbiased estimate of the 
likelihood which i s used in the M-H a cceptance step. The subsampling approach was fur¬ 
ther improved in lOniroz. et al.l (120151) using the efficient and robust difference estimator 


form the survey sampling literature. 

2.2 Divide and Conquer 

A divide and conquer algorithm (which may appear under other names such as divide and 
recombine, split and conquer, or split and merge) generally has three steps: 1 ) partitions 
a big dataset into K blocks; 2) processes each block separately (possibly in parallel); and 
3) aggregates the solutions from each block to form a hnal solution to the full data. 


2.2.1 Aggregated Estimating Equations 

For a linear regression model, the least squares estimator for the regression coefficient f3 
for the full data can be expressed as a weighted average of the least squares estimator for 
each block with weight being the inverse of the estimated variance matrix. The success 
of this method for linear regression depends on the linearity of the estimating equations 
in 13 and that the estimating equation for the full data is a simple summatio n of that for 
all the blocks. For general nonlinear estimating equations, iLin and Xil (120111) proposed a 
linear approximation of the estimating equations with the Taylor expansion at the solution 
in each block, and, hence, rednce the nonlinear estimating equation to the linear case so 
that the solutions to all the blocks are combined by a weighted average. The weight of each 
block is the slope matrix of the estimating function at the solution in that block, which is 
the Fisher i n forma tion or inverse of the variance matrix if the equations are score equations. 
Lin and Xil (120111) showed that, nnder certain technical conditions including K = 0{n^) 
for some 7 G (0,1), the aggregated estimator has the same limit as the estimator from the 
full data. 


2.2.2 Majority Voting 


Chen and Xid (120141) consider a divide and conquer approach for generalized linear models 
(GLM) where both the sample size n and the number of covariates p are large, by incor¬ 
porating variable selection via penalized regression into a subset processing step. More 
specihcally, for p bounded or increasing to inhnity slowly, {pn not faster than o(e”'=), while 
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model size may increase at a rate of o(nfc)), they propose to first randomly split the data 
of size n into K blocks (size Uk = 0{n/K)). In step 2, penalized regression is applied 
to each block separately with a sparsity-inducing penalty function satisfying certain reg¬ 
ularity conditions. This approach can lead to differential variable selection among the 
blocks, as different blocks of data may result in penalized estimates with different non-zero 
regression coefficients. Thus, in step 3, the results from the K blocks are combined by 
majority vote to create a combined estimator. The implicit assumption is that real effects 



and separate analyses to be combined in a meta-analysis. The authors show under certain 
regularity conditions that their combined estimator in step 3 is model selection consistent, 
asymptotically equivalent to the penalized estimator that would result from using all of the 
data simultaneously, and achieves the o racle property wh en it is attainable for the penal¬ 
ized estimator from each block (see e.g.. Fan and Lv . f20li]) . They additionally establish an 


upper bound for the expected number of incorrectly selected variables and a lower bound 
for the expected number of correctly selected variables. 


2.2.3 Screening with Ultrahigh Dimension 


Instead of dividing the data into blocks of observations in step 1, ISong and Liana (120141) 
proposed a split-and-merge (SAM) method that divides the data into subsets of covariates 
for variable selection in ultrahigh dimensional regression from the Bayesian perspective. 
This method is particularly suited for big data where the number of covariates is much 
larger than the sample size n, S> n, and possibly increasing with n. In step 2, Bayesian 
variable selection is separately performed on each lower dimensional subset, which facilitates 
parallel processing. In step 3, the selected variables from each subset are aggregated, and 
Bayesian variable selection is applied on the aggregated data. The embarrassingly parallel 
structure in step 2 makes the SAM method applicable to big data problems with millions or 
more predictors. Posterior consistency is established for correctly specihed models and for 
misspecihed models, the latter of which is necessary because it is quite likely that some true 
predictors are missing. With correct model specihcation, true covariates will be identihed 
as the sample size becomes large; under misspecihed models, all predictors correlated with 
the response vari able will be i denti hed. Compared with the sure independence screening 
(SIS) approach flFan and LvI . 120081) . the method uses the joint information of multiple 
predictors in predictor screening while SIS only uses the marginal information of each 
predictor. Their numerical results show that the SAM approach outperforms competing 
methods for ultrahigh dimensional regression. 


2.2.4 Parallel MCMC 

In the Bayesian framework, it is natural to partition the data into K subsets and run paral¬ 
lel MCMC on each one of them. The prior distribution for each subset is often obtained by 
taking a power 1 /iC of the prior distribution for whole data in order to preserve the total 
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amount of prior information (which may change the impropriety of the prior). MCMC 
is run independently on each subset with no communications between subsets (and, thus, 



which produces the approximated full data posterior using weighted averages over the sub¬ 
set MCMC samples. The weight used (for Gaussian models) for each subset is the inverse 
of the variance-covariance matrix of the MCMC samples. The method is effective when the 
posterior is close to Gaussian but may cause bias when the distribution is skewed or has 
multi-mode s. The consensus Monte Carlo principal is approached from a variational per¬ 
spective by Rabinovich et all ( 2015 1. The embarrassingly parallel feature of these methods 


facilitates their implementat ion in the MapReduce fram ework that exploits the division 
and recombination strategy ( Dean and Ghemawat . lio^). The hnal recombination step is 


implemented in R package parallelMCMCcombine flMiroshnikov and Conlonl . 1201411 . 


Going beyond embarrassingly parallel MCMC remains challenging because of storage is¬ 
sues and communication o verheads. G e neral strategies for parallel MCM C such as ni u ltiple- 
proposal MH algorithm f Calderhe^ . 2014 1 and population MCMC f Song et 20141) 


mostly require full data at each node. 


2.3 Online Updating for Stream Data 

In some applications, data come in streams or large chunks, and a sequentially updated 
analysis is desirable without s toring the data. Mot ivated froni a Bay esian inference per¬ 
spective, ISchifano et al.l (1201511 extends the work of iLin and Xil (120111) in a few important 
ways. First, they introduce divide-and-conquer-type variance estimates of regression pa¬ 
rameters in the linear model and estimating equation settings. These estimates of vari¬ 
ability allow for users to make inferences about the true regression parameters based upon 
the previously developed divide-and-conquer point estimates of the regression parameters. 
Second, they develop iterative estimating algorithms and statistical inferences for linear 
models and estimating equations that update as new data arrive. Thus, while the divide- 
and-conquer setting is quite amenable to parallel processing for each subset, the online¬ 
updating approach for data streams is inherently sequential in nature. Their algorithms 
were designed to be computationally efficient and minimally storage-intensive, as they as¬ 
sume no access/storage of the historical data. Third, the authors address the issue of 
possible rank dehciencies when dealing with blocks of data, and the uniqueness properties 
of the combined and cumulative estimators when using a generalized inverse. The authors 
also provide methods for assessing goodness of ht in the linear model setting, as standard 
residual-based diagnostics cannot be performed with the cumulative data without access to 
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historical data. Instead, they propose outlier tests relying on predictive residuals, which are 
based on the predictive values computed from the cumulative estimate of the regression 
coefficients attained at the previous accumulation point. Additionally, they introduce a 
new online-updated estimator of the regression coefficients and corresponding estimator of 
the standard error in the estimating equation setting that takes advantage of information 
from the previous data. They show theoretically that this new estimator, the cumulative 
updated estimating equation (CUEE) estimator, is asymptotically consistent, and show 
empirically that the CUEE estimator is less biased in their finite sample simulations than 
the cumulatively estimated version of the estimator of Lin and 3 feniih . 


3 Criterion-Based Variable Selection with Online Up¬ 
dating 

To the best of our knowledge, criterion-based variable selection has not yet been considered 
in the online updating context. This problem is well worth investigating especially when 
access/storage of the historical data is limited. Suppose that we have K blocks of data in a 
sequence with Y^, X^, and Uk being the nfc-dimensional vector of responses, the Uk x (p-|-l) 
matrix of covariates, and the sample size, respectively, for the kth block, k = 1,... ,K, such 
that Y = (Y{, Y 2 , ..., and X = (X'j^,..., X'^.)'. Consider the standard linear regression 
model for the whole data with sample size n = 

Y = X/3 + e. 


where (3 is the regression coefficient vector, and e is a normal random vector with mean 
zero and variance 9In- Let M denote the model space. We enumerate the models in 
Ad by m = 1,2, ...,2^, where 2^ is the dimension of M. For the full model, the least 
squares estimate of /3 and the sum of squared errors based on the /cth subset is given by 

h 


nk,k 


= (X^Xfc) X'^Yfc and SSEn^,,k- In fhe sequent ial setting, we only n eed to store and 


update the cumulative estimates at each k (see, e.g. 
Let/3y> = (/3r>. 

all data through subset k for model m, where p. 


Schifano et al.l. l2015l) 


and denote the cumulative estimates based on 


-mi; • • • 

,, denotes a vector 


is the number of covariates for model m. 
We further introduce the (p-|-l)x(pm-|-l) selection matrix = (cmo, Cmi, • • • 
where em^ is a vector with length (p-|-l) and the first element as 1, and e, 
of length (p-l-l) with 1 in the m^-th position and 0 in every other position for all j > 0. Here 
(mi, ...,mp^) are not necessarily in sequence, but represents the index of selected variables 
in the full design matrix Xk. Define X^™'^ = XfcP^™'^ Update a {pm + 1) x {pm + 1) matrix 

yim) _ ^(m)'^(m) y{m) 

k k k k—1 ^ 


where = 0, and a {pm -|- 1) x 1 vector 







where /3 


(m) 

0 


0. After some algebra, we have 





iyr) 


and 



SSE„.i + 

-^rvw/sr’+ssEK, 


With a unknown, letting 



nlog 


27rSSEir^^ 

n - Pm - 1 ’ 


the Akaike information criterion (AIC) and Bayesian information criterion (BIC) are up¬ 
dated by 


AlCy* = £!<’">+n + p„+l, 

^ b ‘'^^ + n - - 1 + (p„^ + 1 ) logTz. 


To study the Bayesian variable selection criteria, assume a joint conjugate prior for 
as follows: follows normal distribution with mean and precision 

matrix Vq, follows Inverse Gamma distribution with shape parameter vq/2 and scale 
parameter ro/2, e.g, 

= ,r(/3<’”'|9l”l. Mo, Vo)ir(9<"‘Vo, To), 


where /Xq is a prespecihed (pm -|- l)-dimensional vector, Vq is a (pm -|- 1) x (pm + 1) positive 
dehnit e matrix, un > 0, 'Tn > 0- It can be shown that the deviance information criterion 
(DIG) f Spiegelhalter et ah . 2nn2h is updated by 


=nlog^^^^^^^—- y2n'ip(^)+ 2pm + n +A, 

where = dlogr(a:)/dr is the digamma function. 

We examined the performance of AIC, BIC and DIG under the online updating scenario 
in a simulation study. Each dataset was generated from linear model p* = x'/3-|-ei, where e^s 
were independently generated from A^(0,100), Xi = (1, xa, a:i 2 , a^is, 2 : 14 ) were identically dis¬ 
tributed random vectors from a multivariate normal distribution with mean (1, 0, 0, 0, 0) and 
marginal variances (0,16, 9, 0.3, 3). Two correlation structures of (xq, Xj 2 , Xi^) were con¬ 
sidered: 1) independent and 2) AR(1) with correlation coefficient 0.9. Four different models 
as determined by the nonzeroness of (3 were considered: (—1, 3, 0, 0, 0), (—1, 3, 0, —1.5, 0), 
(—1, 3, 2, —1.5, 0), and (—1, 3, 2, —1.5,1). The corresponding signal-to-noise ratios were 
1.44, 1.45, 1.81, and 1.83 in the independent case and 1.44, 1.29, 2.85, and 3.33 under the 
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Table 1: Percentages of the simulations that identify the variables indicated on the left for 
various number of blocks (k), subset sample sizes {uk = 100) and correlation within the 
design matrix X (independent or dependent). 


True 

Model 




independent 







dependent 








k = 2h 



k = 100 



k = 2 



k - 25 



k = 100 


AIC 

BIG 

DIG 

AIG 

BIG 

DIG 

AIG 

BIG 

DIG 

AIG 

BIG 

DIG 

AIG 

BIG 

DIG 

AIG 

BIG 

DIG 

a = (- 1 , 3 , 0.0 

, 0 ), signal-to- 

noise ratios ; 

are 1.44 for both independent and dependent. 








none 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(xi) 

59 

93 

59 

60 

98 

60 

59 

99 

59 

63 

94 

62 

64 

99 

64 

64 

99 

64 

^2) 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(2:1,2:2) 

11 

2 

11 

11 

1 

11 

12 

0 

12 

10 

2 

10 

9 

1 

9 

10 

0 

10 

(2^3) 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(21.23) 

11 

2 

11 

11 

1 

11 

11 

0 

11 

8 

2 

8 

8 

0 

8 

8 

0 

8 

(22,23) 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(21,22,23) 

2 

0 

3 

2 

0 

2 

2 

0 

2 

4 

0 

4 

3 

0 

3 

3 

0 

3 

(24) 

0 
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0 
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0 
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(21,24) 

11 

2 

11 

11 

0 

11 

11 

0 

11 

9 

2 

9 

8 

0 

9 

8 

0 

8 

(22.24) 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(21,22,24) 

2 

0 

2 

2 

0 

2 

2 

0 

2 

3 

0 

3 

3 

0 

3 

3 

0 

3 

(23,24) 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(21,23,24) 

2 

0 

2 

2 

0 

2 

2 

0 

2 

4 

0 

4 

4 

0 

4 

4 

0 

4 

(22,23,24) 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

(21,22,23,24) 

1 

0 

1 

0 

0 

0 

0 

0 

0 

1 

0 

1 

1 

0 

1 

1 

0 

1 

0 = (-1,3,0.- 

■ 1 . 5 , 0 ) 

, signal-to-noise ratios are 1.45 for independent and 1.29 for dependent. 







none 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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dependent case. The sample size of each block was set as Uk = 100. The performance of 
the criteria was investigated with the cumulative estimates at block k G {2,25,100}. For 
each scenario, 10,000 independent datasets were generated. 

The percentages of models selected among the 2^ models by each of the three criteria 
are summarized in Table [T] The entire row in bold represents the true model. Based 
on the simulation results, BIG performs extremely well when the number of blocks (/c) 
is large, which is consistent with known results that the proba b ility of select ing the true 
model by BIG approaches 1 as n —)■ oo (e.g., SchwarzI . 1978 : Nishiil. 1984 1. The BIG 
also performs better than AIG and DIG when the covariates are independent, even for 
small sample sizes. When covariates are highly dependent, AIG and DIG provide more 
reliable results when sample size is small. The performance of AIG and DIG is always very 
simil ar. The simuDtion results also conhrm the existing theorem that AIG is not consistent 
(e.g., Woodroo^ 1982ll . In the big data setting with large sample size, BIG is generally 
preferable, especially when the covariates are not highly correlated. 


4 Open Source R and R Packages 

Handling big data is one of the topics of high performance computing. As the most popular 
open source statistical software, R and its adds-on packages provide a wide range of high 
performance computing; see Gomprehensive R Archi ve Network (GRAN ) task view on 
“High-Performance and Parallel Gomputing with R” ( Eddelbuettei 2014 1. The focus of 
this section is on how to break the computer memory barrier and the computing power 
barrier in the context of big data. 


4.1 Breaking the Memory Barrier 

The size of big data is relative to the available computing resources. The theoretical limit 
of random access memory (RAM) is determined by the width of memory addresses: 4 
gigabyte (GB) (2^^ bytes) for a 32-bit computer and 16.8 million terabyte (2®^ bytes) for 
a 64-bit computer. In practice, however, the latter is limited by the physical space of a 
computer case, the operating system, and specihc software. Individual objects in R have 
limits in size too; an R u ser can hardly work with any object of size close to that limit. 
Emerson and Kanel fl2012ll suggested that a data set would be considered large if it exceeds 
20% of RAM on a given machine and massive if it exceeds 50%, in which case, even the 
simplest calculation would consume all the remaining RAM. 

Memory boundary can be broken with an external memory algorithms (EMA) (e.g., 
Vitterl. 2001), which conceptually works by storing the data on a disk storage (which has 
a much greate r limit than RAM), and processing one chunk of it at a time in RAM (e.g., 
Lumlevi. 120131) . The results from each chunk will be saved or updated and the process 
continues until the entire dataset is exhausted; then, if needed as in an iterative algorithm, 
the process is reset from the beginning of the data. To implement an EMA for each 
statistical function, one need to address 1) data management and 2) numerical calculation. 
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4.1.1 Data Management 


Earlier solutions to oversize data resorted to relational databases. This method depends on 
an external database management system (DBMS) such as MySQL, PostgreSQL, SQLite, 
H2, ODBC, O racle, and others. In terface s to R are provided through many R package s 
such as s qldf (Grothendieck . 2014 ). DBI (Ir Special Interest Group on Databases . 2014 1. 


RSQLite fj Wickham et al.l. l2014l) . and others. The database approach requires a DBMS 


to be installed and maintained, and knowledge of str uctured quer y language (SQL); an 
exception for simpler applications is package filehash fjPend . l2006l) . which comes with a 
simple key-value database implementation itself. The numerical functionality of SQL is 
quite limited, and calculations for most statistical analyses require copying subsets of the 
data into objects in R facilitated by the interfaces. Extracting chunks from an external 
DBMS is computa t ional ly much less efficient than the more recent approaches discussed 


below ( Kane et ah . 2013ll 


Two R packages, bigmemory ( Kane et al. . 2013 ) and ff ( Adler et al.l. 2014j) provide data 
structures for massive data while retaining a look and feel of R objects. Package bigmemory 
dehnes a data structure big.matrix for numeric matrices which uses memory-mapped hies 
to allow matrices to exceed the RAM size on computers with 64-bit operating systems. The 
underling technology is memory mapping on modern operating systems that associates a 
segment of virtual memory in a one-to-one correspondence with contents of a hie. These hies 
are accessed at a much faster speed than in the database approaches because operations are 
handled at the operating-system level. The big.matrix structure has several advantages 
such as support of shared memory for efficiency in parallel computing, reference behavior 
that avoids unnecessary temporary copies of massive objects, and column- major format 


that is compatible with legacy linear algebra packages (e.g., BLAS, LAPACK) flKane et al. 


20131) . The package provides utility to read in a csv hie to form a big.matrix object, but 


it only allows one type of data, numeric; it is a numeric matrix after all. 

Package ff provides data structures that are stored in binary hat hies but behave (al¬ 
most) as if they were in RAM by transparently mapping only a section (pagesize) of meta 
data in main memory. Unlike bigmemory, it supports R’s standard atomic data types (e.g., 
double or logical) as well as nonstandard, storage efficient atomic types (e.g., the 2-bit un¬ 
signed quad type allows efficient storage of genomic data as a factor with levels A, T, G, 
and, C). It also provides class ffdf which is like data.frame in R, and import/export hlters 
for CSV hies. A binary hat hie can be shared by multiple f f objects in the same or multiple 
R processes for parallel access. Utility functions allow interactive process of selections of 
big data. 


4.1.2 Numerical Calculation 

The data management systems in packages bigmemory or ff do not mean that one can 
apply existing R functions yet. Even a simple statistical analysis such as linear model or 
survival analysis will need to be implemented for the new data structures. Chunks of big 
data will be processed in RAM one at a time, and often, the process needs to be iterated 
over the whole data. A special case is the linear model htting, where one pass of the data 
is sufficient and no resetting from the beginning is needed. Consider a regression model 
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E\Y] = X/3 with n X 1 response Y, nxp model matrix X and p x 1 coefficient [5. The base 
R implementation Im.fit takes 0(np + p^) memory, which can be reduced dramatically 
by processing in chunks. The hrst option is to compute X'X and X'y in increment, and 
get the lea st squares es timate oi (3, (3 = {X'X)~^X'Y. This method is adopted in package 
speedglm flEneal . I20141). A slower but more accurate option is to compute the incremental 
QR decomposition f Millerl . 1992 1 of X = QR to get R a nd Q'Y, and t hen solve (3 from 
R/3 = Q'Y. This option is implemented in package biglm f Tviimlevl . l2013l) . Function biglm 
uses only memory of p variables and the htted object can be updated with more data 
using update. The package also provides an incremental computation of sandwich variance 
estimator by accumulating a (p + 1)^ x (p + 1)^ matrix of products of X and Y without a 
second pass of the data. 

In general, a numerical calculation needs an iterative algorithm in computation and, 
hence, multiple passes of the data are necessary. For example, a GLM htting is often 
obtained through the iterated reweighted least squares (IRLS) algorithm. The bigglm 
function in package biglm implements the generic IRLS algorithm that can be applied to 
any specihc data management system such as DBMS, bigmemory, or ff, provided that a 
function data(reset = FALSE) supplies the next chunk of data or zero-row data if there 
is no more, and data(reset = TRUE) resets to the beginning of the data for the next 
iteration. Specihc implementation of the data f unction for object of class big.matrix 
and ffdf are prov ided in package biganalytics (IFmerson and Kand. l2013al) and ffbase 
fjjonge et al.l . 120141 ). respectively. 

For any statistical analysis on big data making use of the data management system, one 
would need to implement the necessary numerical calculations like what package biglm does 
for GLM. The family of bigmemory provides a collection of functions for big.matrix ob¬ 
jects: bi ganalytics for basic analytic and statistical functions, bigtabulate for tabulation op¬ 
erations flEmerson and Kanel. l2013blL and b igalgebra for matrix operation with the BLAS 
and LAPAGK libraries ( Kane et ah . 20141) . Some additional functions for big.matrix 


objects are available from other contributed pa ckages, such a s bigpca for principal com- 


est ( 

Lim et ah 

2914) 

(Jonee et ah 

r 

914). 


For ff objects, package ffbase provides basic statistical functions 
Additional functions for ff objects are provided in other packages, 


with examples including biglars for least angle regression and LA SSO (ISeligman et ah 


20 111) and PopGenome for population genetic and genomic analysis flPfeifer et al.l. l2014l) . 


If some statistical analysis, such as generalized estimating equations or Gox proportional 
hazards model, has not been implemented for big dat a, then one will ne ed to modify the 
existing algorithm to implement it. As pointed out bv iKane et ahl (120131 p.5), this would 
open Pandora’s box of recoding which is not a long-term solution for scalable statistical 
analyses; this calls for redesign of the next-generation statistical programming environment 
which could provide seamless scalability through hle-backed memory-mapping for big data, 
help avoid the need for specialized tools for big data management, and allow statisticians 
and developers to focus on new methods and algorithms. 
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4.2 Breaking the Computing Power Barrier 

4.2.1 Speeding Up 


As a high level interpreted language, for which most of instructions are executed directly, 
R is infamously slow with loops. Some loops can be avoided by taking advantage of the 
vectorized functions in R or by clever vectorizing with some effort. When vectorization is 
not straightforward or loops are unavoidable, as in the case of MCMC, acceleration is much 
desired, especially for big data. The least expensive tool in a programmer’s effort to speed 
up R code is to compile them to byte code with the compiler package, which was developed 
by Luke Tierney and is now part of base R. The byte code compiler translates the high-level 
R into a very simple language that can be interpreted by a very fast byte code interpreter, 
or virtual machine. Starting with R 2.14.0 in 2011, the base and recommended packages 
were pre-compiled into byte-code by default. Users’ functions, expressions, scripts, and 
packages can be compiled for an immediate boost in speed by a factor of 2 to 5. 

Computing bottlenecks can be implemented in a compiled language s uch as C/C-f-l- 
or FO RTRAN and interfaced to R through R’s foreign language interfaces flR Core Team! . 
2014bl. ch.5). Typical bottlenecks are loops, recursions, and complex data structures. Re- 


cent developinents h ave made the inte rfacing with C-|--|- m uch easier than it used to be 
f Eddelbuette i 2013h . Package inline f Sklvar et ah . 2013h provides functions that wrap 
C/C-I--I- (or FORTRAN) code as strings in R and takes care of compiling, linking, and 
loading by placing the resulting dynamically-loadable object code in t he per-session tem¬ 
porar y directory used by R. For more general usage, package Repp f Eddelbuettel et 
20111) provides C-|--|- classes for many basic R data typ es, which al l ow st raightforward 
passing of data in both directions. Package RepEigen ( Bates et ah . 20141) provides 


ac¬ 


cess to the high-performance linear algebra library Eigen for a wide variety of matrix 
methods, various decompositions an d support of sparse matrices. Package RcppArmadillo 
( Eddelbuettel and Sanderson . 20141) connects R with Armadillo, a powerful templated lin¬ 
ear algebra li brary which provides a good bala nce between speed and ease of use. Pack¬ 
age RInside ( Eddelbuettel and Francoii . 20141) gives easy access of R objects from C-|--|- 
by wrapping the existing R embedding application programming interface (API) in C-|--|- 
classes. The Repp project has revolutionized the integration of R with C-F-F; it is now used 
by hundreds of R packages. 

Diagnos t ic too ls can help identify the bottlenecks in R code. Package microbenchmark 
( Mersma.nn . 2014|) provides very precise timings for small pieces of source code, making 
it possible to compare operations that only take a tiny amount of time. For a collection 
of code, run-time of each individual operation can be measured with realistic inputs; the 
process is known as prohling. Function Rprof in R does the profil ing, but the outputs 
are n ot intuitive to understand for many users. Packages proftools (iTiernev and Jariour . 
2013) and aprof (Visserl . 20141) provide tools to analyze prohling out puts. Packages profr 
( Wickham . 2014b ). lineprof ( Wickham . 2014c ). and GUIProfller ( de Villar and Rubio . 


20141) provide visualization of prohling results. 
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4.2.2 Scaling Up 


The R package system has long embraced integration of parallel computing of various 
technologies to address the big data challenges. For embarrassingly parallelizable jobs 
such as bootstrap or simulation, where there is no dependency or communication be- 
tween parallel tasks , man y options are available with computer clusters or multicores. 
Schmidberger et ahl f 2009[) reviewed the then st ate-of-them rt parallel computing with R, 
highlighting two packages for cluster use: Rmpi flYul . 120021) which pr ovides an R in t erface 
to the Message Passing Interface (MPI) in parallel computing; snow (IRossini et al.l. 120071) 
which provides an abstract layer with the communication details hidden from the end 
users. Since t hen, some packages have been developed an d some discon tinued. Pack¬ 
ages snowFT f Sevcikova and Rossini . 2012a ') and snowfall f Knausl . 2013 ) extend snow 
with fault toler ance and wrapp ers for easier development of parallel R programs. Pack¬ 
age multicore fjUrbanekl. 120141) provides parallel processing of R code on machines with 
multiple cores or CPUs. Its work and some of snow have been incorporated into the 
base R package parallel, which was hrst included in R 2.14.0 in 2011. Package foreach 
(iRevolution Analytics and Westonl . 120141) allows general iteration over elements in a collec¬ 
tion without any explicit loop counter. Using foreach loop without side effects facilitates 
executing the loop in parallel with different parallel mechanisms, including those provided 
by parallel, Rmpi, and snow. For massive data that exceed the computer memory, a 
combination of foreach and bigmemory, with shared-memory data structure referenced by 
multiple processes, provides a framework with ease o f developmen t and efficiency of exe¬ 
cution (both in speed and memory) as illustrated bv iKane et al.l (120131) . Package Rdsm 
provides facilities for distributed shared memory parallelism at the R level, and combined 
with bigmemory, it enables parallel processing on massive, out-of-core matrices. 

The “Programming with Big Data in R” project (pbdR) enables high-level d istributed 


data parallelism in R with easy utilization of large clusters with thousands of cores fjOstrouchov et al 


2 OI 2 I) . Big data are interpreted quite literally to mean that a dataset requires parallel pro¬ 
cessing either because it does not fit in the memory of a single machine or because its 
processing time needs to be made tolerable. The project focuses on distributed memory 
systems where data are distributed across processors and communications between proces¬ 
sors are based on MPI. It consists of a collection of R packages in a hierarchy. Package 
pbdMPI provides S4 classes to directly interface with MPI to support the Single Program 
Multiple Data (SPMD) parallelism. Package pbdSLAP serves as a mechanism to utiliz e a 
subset of functions of scalable dense linear algebra in ScaLAPACK ( Blackford et ahl. 1997 ). a 
subset of LAPACK routines redesigned with the SPMD style. Package pbdBASE contains a 
set of wrappers of low level functions in ScaLAPACK, upon which package pbdMAT builds 
to provide distributed dense matrix computing while preserving the friendly and familiar 
R syntax for these computations. Demonstrations on how to use these and other packages 
from the pbdR are available in package pbdDEMO. 

A recent, widely adopted open source framework for massive data storage and dis¬ 
tributed computing is Hadoop ( The Apache Software Foundation . 2014bf) . Its heart is an 


imple mentation of the MapReduce programming model first developed at Google (jPean and Ghemawat 


20081 ). which divides the data to distributed systems and computes for each group (the map 
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step), and then recombines the results (the reduce step). It provides fault tolerant and scal¬ 
able storage of massive datasets across machines in a cluster fjWhitel . 120111 ). The model suits 
perfectly the embarrass ingly parallelizable jobs and th e distributed hie system helps break 
the memory boundary. McCallnm and Weston f 201ll. ch.5-8) demonstrated three ways to 
combine Hadoop and R. The hrst is to submit R scripts directly to a Hadoop cluster, which 
gives the user the most control and the most power, but comes at the cost of a Hadoop 
learning curve. The second is a pure R solution via package Rhipe, which hides the commu¬ 
nications to Hadoop from R users. The package (not on CRAN) is from the RHIPE proiect , 
which stands for R and Hadoop Integrated Programming Environment flGnha et al.l. l2012l) . 
With Rhipe, data analysts only need to write R code for the map step and the reduce step 
( Guha et al.l. l2012l) . and get the power of Hadoop without leaving R. The third approach 
target s spec ihcally the Elastic MapReduce (EMR) at Amazon by a CRAN package segue 
flLond. l2012l) . which makes EMR as easy to use as a parallel backend for lapply-style op¬ 
erations. An alternative open source project that connects R and H adoop is the RHadoop 
project, which is actively being developed by Revolution Analytics f Revolution Analvticsl . 


20141) . This project is a collection of R packages that allow users to manage and analyze 


data with Hadoop: rhbase provides functions for database management for the HBase dis¬ 
tributed database, rhdfs provides functions for Hadoop distributed hie system (HDFS), rmr 
provides functions to Hadoop MapReduce functionality, plymr provides higher level data 
processing for structured data, and the most recent addition ravro provides reading and 
writing fun ctions for hies in avro format, an efhcien t data serialization system developed 
at Apache ( The Apache Software Foundation!. 2014a ). 

Sp ark is a more recent, cousin project of Hado op that supports tools for big data related 
tasks flThe Apache Software Foundation!. 2014c ). The functions of Spark and Hadoop are 
neither the exactly same nor mutually exclusive, and they often work together. Hadoop 
has its own distributed storage system, which is fundamental for any big data computing 
framework, allowing vast datasets to be stored across the hard drives of a scalable computer 
cluster rather than on a huge costly hold-it-all device. It persists back to the disk after 
a map or reduce action. In cont rast, Spark does not have its own distributed hie system, 
and it processes data in-memory (jZaharia et al.l. 120101) . The biggest diherence is disk-based 
computing versus memory-based computing. This is why Spark could work 100 times faster 
than hadoop for some applications when the data ht in the memory. Some applications 
such as machine learning or stream processing where data are repeatedly queried makes 
Spark an ideal framework. For big data that does not ht in memory. Spark’s operators 
spill data to disk, allowing it to run well on any sized data. For this purpose, it can be 
installed on top Hadoop to take advantage of Hadoop ’s HDFS. An R frontend to Spark is 
provided in R package SparkR fj Venkataram an! 120131) . which has become part of Apache 
Spark recently. By using Spark’s distributed computation engine, the package allows users 


to run large scale data analysis such as selection, hltering, aggregation from R. iKarau et al 
(120151) provides a summary of the state-of-the-art on using Spark. 

As multicores have become the standard setup for computers today, it is desirable to 
automatically make use of the cores in implicit parallelism without any explicit requests 
from the user. The experimental packages pnmath and pnmathO by Luke Tierney replace 
a number of internal vector operations in R with alternatives that can take advantage 
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of multicores flTiernevl . 120091 ). For a serial algorithm such as MCMC, it is desirable to 
parallelize the computation bottleneck if possible, but this generally inv olves learni n g new 
computing tools and the debugging can be challengi ng. For instance. lYan et al.l (120071) 
used the parallel linear algebra package (PLAPACK) ( van de Geiinl. 1997 ) for the matrix 
operations (especially the Cholesky decomposition) in a MCMC algorithm for Bayesian 
spatiotemporal geostatistical models, but the scalability was only moderate. 

When random numbers are involved as in the case of simulation, extra care is needed 
to make sure the parallelized jobs run indep endent (and preferably reproducible) random- 
number streams. Package rsprng (hj, l2010l) provides an interface to th e Scalable Parallel 
Random Number Generators (S PRNG) (iMascagni and Srinivasanl. 2000 ). Package rlecuyer 
( Sevcikova and Rossinil . 2012b[) provides an i nterface to the random number generator with 
multiple independent streams developed by L’Ecuver et ah ( 20021) . the ideas of which are 
also implemented in the base package parallel: make independent streams by se parating a 
single stream with a sufficiently large number of steps apart. Package doRNG ( Gauiou:^. 


20141) provides functions to perform reproducible parallel foreach loops, independent of 


the parallel environment and associated foreach backend. 

From a hardware perspective, many computers have mini clusters of graphics processing 
units (GPUs) that can help with bottlenecks. GPUs are dedicated numerical processors 
that were originally designed for rendering three dimensional computer graphics. A GPU 
has hundreds of processor cores on a single c hip and can be progra mmed to apply the same 
numerical operations on large data array. Suchard et ah ( 20101 ) investigated the use of 
GPUs in massively parallel massive mixture modeling, and showed better performance of 
GPUs than multicore GPUs, especially for larger samples. To reap the advantage, however, 
one needs to learn the related tools such as Gompute Unihed Device Architecture (GUDA), 
Open Gor nputing Language (O penGL), and so on, which may be prohibitive. An R package 
gputools ( Buckner et ah . 20131) provides interface to NVidia GUDA toolkit and others. 

If one is willing to step out of the comfort zone of R and take full control/responsibility 
of parallel computing, one may program with open source MPI or Open Multi-Processing 
(OpenMP). MPI is a language-independent communication system designed for p rogram¬ 
ming on parallel computers, targeting high performance, scalability and portability (jPachecol . 


19971 ). Most MPI implementations are available as libraries from C/C-I--I-, FORTRAN, and 


any language that can interface with such libraries, including C^, Java or Python. The in¬ 
terface from R can be accessed with package Rmpi (lYul . l2002l) as mentioned earlier. Freely 
available implementations include OpenMPI (not OpenMP) and MPIGH, while others come 
with license such as Intel MPI. OpenMP is an API that supports multi-platform shared 
memory multiprocessing programm ing in C/C-FJ- and FO RTRAN on most processor ar¬ 


chitectures and operating systems (iGhapman et al.l . 120081) . It is an add on to compilers 


(e.g., gcc, intel compiler) to take advantage of of shared memory systems such as multi¬ 
core computers where processors shared the main memory. MPI targets both distributed 
as well as shared momory systems while OpenMP targets only shared memory systems. 
MPI provides both process and thread based approach while OpenMP provides only thread 
based parallilism. OpenMP uses a portable, scalable model that gives programmers a sim- 
ple and flexible interface for writing multi-threaded programs in C/C++ and FORTRAN 
( Dagum and Enon . 19981) . Debugging parallel programs can be very challenging. 
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5 Commercial Statistical Software 


RRE is the core product of Revolution Analytics (formerly Revolution Computing), a com¬ 
pany that provides R tools, support, and training. RRE focuses on big data, large scale 
multiprocessor (or high performance) computing, and multicore functionality. Massive 
datasets are handled via EMA and parallel EMA (PEMA) when multiprocessors or mul- 
ticores are available. The commercial package RevoScaleR flRevolution Analvticsl . 120131) 
breaks the memory boundary by a special XDF data format that allows efficient storage and 
retrieval of data. Functions in the package (e.g., rxGlm for GLM htting) know to work on 
a massive dataset one chunk at a time. The computing power boundary is also addressed 
— functions in the package can exploit multicores or computer clusters. Packages from the 
aforementioned open source project RHadoop developed by the company provide support 
for Hadoop. Other components in RRE allow high speed connection for various types of 
data sources and threading and inter-process communication for parallel and distributed 
computing. The same code works on small and big data, and on workstations, servers, clus¬ 
ters, Hadoop, or in the cloud. The single workstation version of RRE is free for academic 
use currently, and was used in the case study in Section O 

SAS, one of the most widely used commercial software for statistical analysis, provides 
big data support through SAS High Performance Analytics. Massive datasets are ap¬ 
proached by grid computing, in-database processing, in-memory analytics and connection 
to Hadoop. The SAS High Performance Analytics Products include statistics, econometrics, 
optimization, forecasting, data mining, and text mining, which, respectively, correspond to 
SAS p roducts STAS, ETS, OR, hig h-performance forecasting, enterprise miner, and text 
miner (jCohen and Rodriguezl. 120131) . 

IBM SPSS, the Statistical Product and Services Solution, provides big data analytics 
through SPSS Modeler, SPSS Anal ytic S e rver, SPSS Collaboration and Deployment Ser¬ 
vices, and SPSS Analytic Catalyst HBMI. 2014 ). SPSS Analytic Server is the foundation 
and it focuses on high performance analytics for data stored in Hadoop-based distributed 
systems. SPSS modeler is the high-performance data mining workbench, utilizing SPSS 
Analytic Server to leverage big data in Hadoop environments. Analysts can dehne analy¬ 
sis in a familiar and accessible workbench to conduct analysis modeling and scoring over 
high volumes of varied data. SPSS Collaboration and Deployment Services helps man¬ 
age analytical assets, automate processes and efficiently share results widely and securely. 
SPSS Analytic Catalyst is the automation of analysis that makes analytics and data more 
accessible to users. 

MATLAB provides a num ber of tools to tackle the challenges of big data analytics 


(iThe MathWorks. Inc.l . 120141) . Memory mapped variables map a hie or a proportion of 


a hie to a variable in RAM; disk variables direct access to variables from hies on disk; 
datastore allows access to data that do not ht into RAM. Their combination addresses 
the memory boundary. The computation power boundary is broken by intrinsic multicore 
math, GPU computing, parallel computing, cloud computing, and Hadoop support. 
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Table 2: Timing results (in seconds) for reading in the whole 12GB data, transforming to 
create new variables, and htting the logistic regression with three methods: bigmemory, 
fF, and RRE. 



Reading 

Transforming 

Fitting 

bigmemory 

968.6 

105.5 

1501.7 

fF 

1111.3 

528.4 

1988.0 

RRE 

851.7 

107.5 

189.4 


6 A Case Study 


The airline on-time performance data from the 2009 ASA Data Expo (http: //stat-computing. org/datac 
is used as a case study to demonstrate a logistic model htting with a massive dataset 
that exceeds the RAM of a single computer. The data i s publ icly available and has 
been used for demonstration with big data by Kane et'aP ( 20131 ) and others. It con¬ 
sists of hight arrival and departure details for all commercial hights within the USA, 
from October 1987 to April 2008. About 12 million hights were recorded with 29 vari¬ 
ables. A compressed version of the pre-processed data set from the bigmemory project 
(http://data.jstatsoft.org/v55/il4/Airline.tar.bz2) is approximately 1.7GB, and 
it takes 12GB when uncompressed. 

The response of the logistic regression is late arrival which was set to 1 if a hight was 
late by more than 15 minutes and 0 otherwise. Two binary covariates were created from 
the departure time: night (1 if departure occurred between 8pm and 5am) and weekend 
(1 if departure occurred on weekends and 0 otherwise). Two continuous covariates were 
included: departure hour (DepHour, range 0 to 24) and distance from origin to destination 
(in 1000 miles). In the raw data, the departure time was an integer of the HHmm format. 

It was converted to minutes hrst to prepare for DepHour. Three methods are considered 
in the case study: 1) combination of bigglm with package bigmemory; 2) combination 
of bigglm with package fF; and 3) the academic, single workstation version of RRE. The 
default settings of fF were used. Before htting the logistic regression, the 12GB raw data 
needs to be read in from the csv format, and new variables needs to be generated. This 
leads to a total of 120, 748, 239 observations with no missing data. The R scripts for the 
three methods are in the supplementary materials for interested readers. 

The R scripts were executed in batch mode on a 8-core machine running GenOS (a 
free Linux distribution functionally compatible with Red Hat Enterprise Linux which is 
officially supported by RRE), with Intel Gore i7 2.93GHz GPU, and 16GB memory. Table [2] 
summarizes the timing results of reading in the whole 12GB data, transforming to create 
new variables, and htting the logistic regression with the three methods. The chunk sizes 
were set to be 500,000 observations for all three methods. For RRE, this was set when 
reading in the data to the XDF format; for the other two methods, this was set at the htting 
stage using function bigglm. Under the current settings, RRE has a clear advantage in 
htting with only 8% of the time used by the other two approaches. This is a result of 
the joint force of its using all 8 cores implicitly and efficient storage and retrieval of the 
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Table 3: Logistic regression results for late arrival. 



Estimate 

Std. Error (xlO^) 

(Intercept) 

-2.985 

9.470 

DepHour 

0.104 

0.601 

Distance 

0.235 

4.032 

Night 

-0.448 

8.173 

Weekend 

-0.177 

5.412 


Table 4: Time results (in seconds) for parallel computing quantiles of departure delay for 
each day of the week with 1 to 8 cores using foreach. 



1 

2 

3 

4 

5 

6 

7 

8 

bigmemory 

22.1 

11.2 

7.8 

6.9 

6.2 

6.3 

6.4 

6.8 

ff 

21.4 

11.0 

7.1 

6.7 

5.8 

5.9 

6.1 

6.8 


data; the XDF version of the data is about 1/10 of the size of the external files saved by 
bigmemory or ff. Using bigmemory and using ff in bigglm had very similar performance in 
fitting the logistic regression, but the former took less time in reading, and significantly less 
time (only about 1/5) in transforming variables of the latter. The bigmemory method was 
quite close to the RRE method in the reading and the transforming tasks. The ff method 
took longer in reading and transforming than the bigmemory method, possibly because it 
used much less memory. 

The results of the logistic regression are identical from all methods, and are summarized 
in Table [3l Flights with later departure hour or longer distance are more likely to be delayed. 
Night flights or weekend flights are less likely to be delayed. Given the huge sample size, all 
coefficients were highly signihcant. It is possible, however, that p-values can still be useful. 
A binary covariate wi th very low rate of ev ent may still have an estimated coefficient with 
a not-so-low p-value fISchifano et al.l . 120151) . an effect only estimable with big data. 

As an i ll ustrat ion of foreach for embarrassingly parallel computing, the example in 
Kane et ah ( 20131) is expanded to include both bigmemory and ff. The task is to find 


three quantiles (0.5, 0.9, and 0.99) of departure delays for each day of the week; that is, 
7 independent jobs can run on 7 cores separately. To make the task bigger, each job was 
set to run twice. The resulting 14 jobs were parallelized with foreach on the same Linux 
machine using 1 to 8 cores for the sake of illustration. The R script is included in the 
supplementary materials. The timing results are summarized in Table IH There is little 
difference between the two implementations. When there is no communication overhead, 
with 14 jobs one would expect the run time to reduce to 1/2, 5/14, 4/14, 3/14, 3/14, 2/14, 
and 2/14, respectively, with 2, 3, 4, 5, 6, 7 and 8 cores. The impact of communication cost 
is obvious in Table 01 The time reduction is only closer to the expectation in the ideal case 
when the number of cores is smaller. 
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7 Discussion 


This article presents a recent snapshot on statistical analysis with big data that exceed 
the memory and computing capacity of a single computer. Albeit under-appreciated by 
the general public or even mainstream academic community, computational statisticians 
have made respectable progress in extending standard statistical analysis to big data, with 
the most notable achievements in the open source R community. Packages bigmemory 
and fF make it possible in principle to implement any statistical analysis with their data 
structure. Nonetheless, for anything that has not been already implemented (e.g., survival 
analysis, generalized estimating equations, mixed effects model, etc.), one would need to 
implement an EMA version of the computation task, which may not be straightforward 
and may involve some steep learning curves. Hadoop allows easy extension of algorithms 
that do not require multiple passes of the data, but such analyses are mostly descriptive. 
An example is visualization, an important tool in exploratory analysis. With big data, the 
bottleneck is the number of pixels in th e screen . The bin-summarize- smooth framewo rk for 
visualization of large data of Wickham (2014a) with package bigvis f Wickhainl. 2013 ) may 
be adapted to work with Hadoop. 

Big data present challenges much further beyond the territory of classic statistics, re quir¬ 
ing jo int workforce with domain knowledge, computing skills, and statistical thinking flYul . 


20141) . Statisticians have much to contribute to both the intellectual vitality and the prac¬ 


tical utility of big data, but will have to expand their comfort zone to eng age high-impact, 
real w orld problems which are often less structured or with ambiguity flJordan and Liru . 


20141) . Examples are to provide structure for poorly dehned problems, or to deve lop ni eth- 


ods/models for new types of data such as image or network. As suggested by[Yu| (120141) . to 
play a critical role in the arena of big data or own data science, statisticians need to work 
on real problems and relevant methodology and theory will follow naturally. 
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Supplementary Materials 

Four R scripts (and their outputs), along with a descriptive README hie are provided for 
the case study. The hrst three are the logistic regression with, respectively, combination of 
bigmemory with bigglm (bigmemory .R), combination of flf with bigglm (ff .R), and RRE 
(RevR.R); their output hies have .Rout extensions. The hrst two run with R, while the 
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third one needs RRE. The fourth script is for the parallel computing with foreach combined 
with bigmemory and fF, respectively. 
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